library(tidyverse)
library(stringr)
library(ggpubr)
library(knitr)
If you don’t keep your data in the same directory as the code, adapt the path names.
dir1 <- "~"
dir2 <- "Desktop"
dir3 <- "AC"
dir4 <- "Useful"
dir5 <- "Carrer"
dir6 <- "Skills "
dir7 <- "3. Skills para trabajo"
dir8 <- "10. R data science, statistics, machine learning"
dir9 <- "Portfolio analysis"
dir10 <- "3. Marketing Analytics"
file_name <- "Data"
PSDS_PATH <- file.path(dir1, dir2, dir3, dir4, dir5, dir6, dir7, dir8, dir9, dir10, file_name)
Data <- read_csv(file.path(PSDS_PATH, 'ml_project1_data.csv'))
Data<- arrange(Data,ID)
It would be useful to have a feature with the age of the clients.
As the last date in the data is on june of 2014, we are going to assume that we are performing this analysis in 2014 for the age calculations.
Data <- Data %>%
mutate(Age = 2014 - Year_Birth)
str(Data)
## tibble [2,240 × 30] (S3: tbl_df/tbl/data.frame)
## $ ID : num [1:2240] 0 1 9 13 17 20 22 24 25 35 ...
## $ Year_Birth : num [1:2240] 1985 1961 1975 1947 1971 ...
## $ Education : chr [1:2240] "Graduation" "Graduation" "Master" "PhD" ...
## $ Marital_Status : chr [1:2240] "Married" "Single" "Single" "Widow" ...
## $ Income : num [1:2240] 70951 57091 46098 25358 60491 ...
## $ Kidhome : num [1:2240] 0 0 1 0 0 0 1 1 0 1 ...
## $ Teenhome : num [1:2240] 0 0 1 1 1 1 0 1 1 0 ...
## $ Dt_Customer : Date[1:2240], format: "2013-05-04" "2014-06-15" ...
## $ Recency : num [1:2240] 66 0 86 57 81 91 99 96 9 35 ...
## $ MntWines : num [1:2240] 239 464 57 19 637 43 185 18 460 32 ...
## $ MntFruits : num [1:2240] 10 5 0 0 47 12 2 2 35 1 ...
## $ MntMeatProducts : num [1:2240] 554 64 27 5 237 23 88 19 422 64 ...
## $ MntFishProducts : num [1:2240] 254 7 0 0 12 29 15 0 33 16 ...
## $ MntSweetProducts : num [1:2240] 87 0 0 0 19 15 5 2 12 12 ...
## $ MntGoldProds : num [1:2240] 54 37 36 8 76 61 14 6 153 85 ...
## $ NumDealsPurchases : num [1:2240] 1 1 4 2 4 1 2 5 2 3 ...
## $ NumWebPurchases : num [1:2240] 3 7 3 1 6 2 6 3 6 2 ...
## $ NumCatalogPurchases: num [1:2240] 4 3 2 0 11 1 1 0 6 2 ...
## $ NumStorePurchases : num [1:2240] 9 7 2 3 7 4 5 4 7 3 ...
## $ NumWebVisitsMonth : num [1:2240] 1 5 8 6 5 4 8 7 4 6 ...
## $ AcceptedCmp3 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp4 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp1 : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : num [1:2240] 0 1 0 0 0 0 0 0 0 0 ...
## $ Complain : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
## $ Z_CostContact : num [1:2240] 3 3 3 3 3 3 3 3 3 3 ...
## $ Z_Revenue : num [1:2240] 11 11 11 11 11 11 11 11 11 11 ...
## $ Response : num [1:2240] 0 1 0 0 0 0 0 0 0 1 ...
## $ Age : num [1:2240] 29 53 39 67 43 49 38 54 56 27 ...
#Unique categories in each categorical column
unique(Data$Education)
## [1] "Graduation" "Master" "PhD" "2n Cycle" "Basic"
unique(Data$Marital_Status)
## [1] "Married" "Single" "Widow" "Divorced" "Together" "Alone" "YOLO"
## [8] "Absurd"
As it can observed, there are no trailing or leading spaces, misspelings, or blank spaces, therefore, this portion of the data is cleaned.
| Group | Range | Data_Type |
|---|---|---|
| ID | 0-11191 | Categorical nominal |
| Age | 18-121 | Numeric discrete |
| Income | 1730-666666 | Numeric continuous |
| Kidhome | 0-2 | Numeric discrete |
| Teenhome | 0-2 | Numeric discrete |
| Dt_Customer | 2012-07-30 to 2014-06-29 | Numeric discrete |
| Recency | 0-99 | Numeric discrete |
| MntWines | 0-1493 | Numeric continuous |
| MntFruits | 0-199 | Numeric continuous |
| MntMeatProducts | 0-1725 | Numeric continuous |
| MntFishProducts | 0-259 | Numeric continuous |
| MntSweetProducts | 0-263 | Numeric continuous |
| MntGoldProds | 0-362 | Numeric continuous |
| NumDealsPurchases | 0-15 | Numeric discrete |
| NumWebPurchases | 0-27 | Numeric discrete |
| NumCatalogPurchases | 0-28 | Numeric discrete |
| NumStorePurchases | 0-13 | Numeric discrete |
| NumWebVisitsMonth | 0-20 | Numeric discrete |
| AcceptedCpm1 | 0-1 | Categorical nominal |
| AcceptedCpm2 | 0-1 | Categorical nominal |
| AcceptedCpm3 | 0-1 | Categorical nominal |
| AcceptedCpm4 | 0-1 | Categorical nominal |
| AcceptedCpm5 | 0-1 | Categorical nominal |
| Complain | 0-1 | Categorical nominal |
| Z_CostContact | 3-3 | Not sure |
| Z_Revenue | 11-11 | Not sure |
| Response | 0-1 | Categorical nominal |
| Education | - | Categorical nominal |
| Marital Status | - | Categorical nominal |
It can be observed that the ranges of numerical data are as expected, therefore, this portion of the data is cleaned.
duplicates <- duplicated(Data$ID)
num_true <- sum(duplicates)
print(num_true)
## [1] 0
remove(duplicates,num_true)
We can conclude that there are no duplicates.
Before we start, it would be useful to know how many Responses we had in our test.
We know that there are 2240 observations and that Response is a binary datatype (0 or 1).
table(Data$Response)
##
## 0 1
## 1906 334
I’ll start by exploring the categorical features (Education and Marital_Status).
We noticed that there is a pattern worth studying forward:
*People with PhD’s tend to respond better to the ad.
In the next section (Statistical analysis), we will determine wether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
We noticed that there is a pattern worth studying forward:
*When customers are not in a relationship (“Single”, “Widow”, “Divorced”) their response to the ad increases.
*The opposite effect is true also, when customers are in a relationship (“Married” or “Toghether”) their respond to the ad decreases.
In the next section (Statistical analysis), we will determine wether these differences are statistical significant or not to decided if we can use it in our model to predict the outcome.
We noticed that there is a pattern worth studying forward:
*People who complain are way more likely to not respond to campaign 6.
In the next section (Statistical analysis), we will determine wether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
We noticed that there is a pattern worth studying forward:
*People that responded positively to our previous campaigns are more likely to respond positively to campaign 6.
In the next section (Statistical analysis), we will determine wether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
Now it’s time to study the numerical discrete features (Kidhome, Teenhome, AcceptedCpm[1,2,3,4,5], Complain), these can be explored similarly to the categorical features.
We noticed that there is a pattern worth studying forward:
*The more kids people have, the less likely they are to respond to our ads.
In the next section (Statistical analysis), we will determine wether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
We noticed that there is a pattern worth studying forward:
*The more teens people have, the less likely they are to respond to our ads.
In the next section (Statistical analysis), we will determine wether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
We notice that there are a few outliers above 100, probably these people found the cure to death, but just to be sure we decided to exclude them.
Data <- Data %>%
filter(Age < 100)
As it is observed in the boxplot, there’s a minimal difference in age for the people who had a positive response to our ad vs those who didn’t.
Therefore, we will not pursue a statistical analysis nor use this feature in the model to predict the response.
We noticed that there is a pattern worth studying forward:
*People with lower recency tend to respond better to the ad.
In the next section (Statistical analysis), we will determine wether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
We notice that there are a few outliers above 7.5, we decided to exclude them.
Data <- Data %>%
filter(NumDealsPurchases < 7.5)
As it is observed in the boxplot, there’s a minimal difference in NumDealsPurchases for the people who had a positive response to our ad vs those who didn’t.
Therefore, we will not pursue a statistical analysis nor use this feature in the model to predict the response.
We notice that there are a few outliers above 15, we decided to exclude them.
Data <- Data %>%
filter(NumWebPurchases < 15)
We noticed that there is a pattern worth studying forward:
*People with higher NumWebPurchases tend to respond better to the ad.
In the next section (Statistical analysis), we will determine wether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
We notice that there are a few outliers above 15, we decided to exclude them.
Data <- Data %>%
filter(NumCatalogPurchases < 15)
We noticed that there is a pattern worth studying forward:
*People with higher NumCatalogPurchases tend to respond better to the ad.
In the next section (Statistical analysis), we will determine wether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
As it is observed in the boxplot, there’s a minimal difference in NumStorePurchases for the people who had a positive response to our ad vs those who didn’t.
Therefore, we will not pursue a statistical analysis nor use this feature in the model to predict the response.
We notice that there are a few outliers above 12, we decided to exclude them.
Data <- Data %>%
filter(NumWebVisitsMonth < 12)
We noticed that there is a pattern worth studying forward:
*The distribution for people who response better to the ad is wider, therefore the observations that are closer to the tales can help our model predict better the outcome.
In the next section (Statistical analysis), we will determine wether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
We noticed that there is a pattern worth studying forward:
*The farther the day of enrollment with the company, the better the clients respond to the ad.
In the next section (Statistical analysis), we will determine wether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
Now it’s time to study the numerical features
We notice that there are a few outliers above 140000, we decided to exclude them:
Data <- Data %>%
filter(Income < 140000)
We noticed that there is a pattern worth studying forward:
*The higher the client’s income, the better the clients respond to the ad.
In the next section (Statistical analysis), we will determine whether this difference is statistical significant or not to decided if we can use it in our model to predict the outcome.
We noticed that there is a pattern worth studying forward:
*Overall, the more the clients spend in any category of products, the more likely they are to respond better to our ad.
In the next section (Statistical analysis), we will determine whether these differences are statistical significant or not to decided if we can use it in our model to predict the outcome.
Before we start the statistical tests, we created a table defining which variables we are going to perform the tests on.
| Group | Pursue_a_statistical_analysis |
|---|---|
| ID | NO |
| Age | NO |
| Income | YES |
| Kidhome | YES |
| Teenhome | YES |
| Dt_Customer | YES |
| Recency | YES |
| MntWines | YES |
| MntFruits | YES |
| MntMeatProducts | YES |
| MntFishProducts | YES, |
| MntSweetProducts | YES |
| MntGoldProds | YES |
| NumDealsPurchases | NO |
| NumWebPurchases | YES |
| NumCatalogPurchases | YES |
| NumStorePurchases | NO |
| NumWebVisitsMonth | YES |
| AcceptedCpm1 | YES |
| AcceptedCpm2 | YES |
| AcceptedCpm3 | YES |
| AcceptedCpm4 | YES |
| AcceptedCpm5 | YES |
| Complain | YES |
| Z_CostContact | NO |
| Z_Revenue | NO |
| Education | YES |
| Marital Status | YES |
Also, it would be useful to filter the Data by response for the tests.
Data_0 <- Data %>%
filter(Response == 0)
nrow(Data_0)
## [1] 1830
Data_1 <- Data %>%
filter(Response == 1)
nrow(Data_1)
## [1] 323
The Data_1 dataframe has 323 observations, and the data_0 dataframe has 1830.
In this test we are going to compare the difference of the median for the Response (0 vs 1) to determine whether our findings are statistical significant or random chance.
To avoid making the analysis longer, we are just going to show the calculations for the first feature (Income), then we are just going to show the result of the analysis for the remaining features.
median_Income_0 <- median(Data_0$Income)
median_Income_0
## [1] 49724
median_Income_1 <- median(Data_1$Income)
median_Income_1
## [1] 64509
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that Income could be a pontential feature for the model to make predictions.
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that MntWines could be a potential feature for the model to make predictions.
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that MntFruits could be a potential feature for the model to make predictions.
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that MntMeatProducts could be a potential feature for the model to make predictions.
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that MntFishProducts could be a potential feature for the model to make predictions.
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that MntSweetProducts could be a potential feature for the model to make predictions.
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that MntGoldProds could be a potential feature for the model to make predictions.
In this test we are going to compare the counts for the Response (0 vs 1) to determine whether our findings are statistical significant or random chance.
To avoid making the analysis longer, we are just going to show the calculations for the first feature (AcceptedCmp1), then we are just going to show the result of the analysis for the remaining features.
We start by making a table with the counts.
## # A tibble: 4 × 3
## # Groups: AcceptedCmp1, Response [4]
## AcceptedCmp1 Response n
## <dbl> <dbl> <int>
## 1 0 0 1768
## 2 0 1 244
## 3 1 1 79
## 4 1 0 62
Then we calculate the % of conversion for people who AcceptedCmp1 vs people who didn’t AccepetedCmp1, and we substract them.
obs_pct_diff_Cmp1 <- 100 * (79/141 -244/2012) #%conv1 - %conv2
obs_pct_diff_Cmp1
## [1] 43.90113
Now we are going to determine whether this difference is statistical
significant.
## [1] 43.90113
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that AcceptedCmp1 could be a potential feature for the model to make predictions.
We start by making a table with the counts.
## # A tibble: 4 × 3
## # Groups: AcceptedCmp2, Response [4]
## AcceptedCmp2 Response n
## <dbl> <dbl> <int>
## 1 0 0 1820
## 2 0 1 303
## 3 1 1 20
## 4 1 0 10
Then we calculate the % of conversion for people who AcceptedCmp1 vs people who didn’t AccepetedCmp1, and we substract them.
obs_pct_diff_Cmp2 <- 100 * (20/30 -303/2123) #%conv1 - %conv2 of response by accepted cmp
obs_pct_diff_Cmp2
## [1] 52.39441
## [1] 52.39441
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that AcceptedCmp2 could be a potential feature for the model to make predictions.
We start by making a table with the counts.
## # A tibble: 4 × 3
## # Groups: AcceptedCmp3, Response [4]
## AcceptedCmp3 Response n
## <dbl> <dbl> <int>
## 1 0 0 1745
## 2 0 1 246
## 3 1 0 85
## 4 1 1 77
Then we calculate the % of conversion for people who AcceptedCmp1 vs people who didn’t AccepetedCmp1, and we substract them.
obs_pct_diff_Cmp3 <- 100 * (77/162 -246/1991) #%conv1 - %conv2 of response by accepted cmp
obs_pct_diff_Cmp3
## [1] 35.17526
## [1] 35.17526
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that AcceptedCmp3 could be a potential feature for the model to make predictions.
We start by making a table with the counts.
## # A tibble: 4 × 3
## # Groups: AcceptedCmp4, Response [4]
## AcceptedCmp4 Response n
## <dbl> <dbl> <int>
## 1 0 0 1730
## 2 0 1 264
## 3 1 0 100
## 4 1 1 59
Then we calculate the % of conversion for people who AcceptedCmp1 vs people who didn’t AccepetedCmp1, and we substract them.
obs_pct_diff_Cmp4 <- 100 * (59/159 -264/1994) #%conv1 - %conv2 of response by accepted cmp
obs_pct_diff_Cmp4
## [1] 23.8672
## [1] 23.8672
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that AcceptedCmp4 could be a potential feature for the model to make predictions.
We start by making a table with the counts.
## # A tibble: 4 × 3
## # Groups: AcceptedCmp5, Response [4]
## AcceptedCmp5 Response n
## <dbl> <dbl> <int>
## 1 0 0 1760
## 2 0 1 232
## 3 1 1 91
## 4 1 0 70
Then we calculate the % of conversion for people who AcceptedCmp1 vs people who didn’t AccepetedCmp1, and we substract them.
obs_pct_diff_Cmp5 <- 100 * (91/161 -232/1992) #%conv1 - %conv2 of response by accepted cmp
obs_pct_diff_Cmp5
## [1] 44.87515
## [1] 44.87515
## [1] 0
The result is a p-value of 0, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that AcceptedCmp5 could be a potential feature for the model to make predictions.
We start by making a table with the counts.
## # A tibble: 4 × 3
## # Groups: Complain, Response [4]
## Complain Response n
## <dbl> <dbl> <int>
## 1 0 0 1813
## 2 0 1 320
## 3 1 0 17
## 4 1 1 3
Then we calculate the % of conversion for people who Complain vs people who didn’t Complain, and we substract them.
obs_pct_diff_Complain <- 100 * (320/2133-3/20) #%conv1 - %conv2 of response by accepted cmp
obs_pct_diff_Complain
## [1] 0.002344116
#result is 0.002344116
## [1] 0.002344116
## [1] 0.3161538
The result is a p-value of 41.1%, this means that in 41.1% of the time it can be a result of random chance. Therefore, the null hypothesis is proven and we conclude that this result is not statistical significant.
Therefore, we will not consider this features for the model to make predictions.
In this test we are going to compare the counts for the Response (0 vs 1) to determine whether our findings are statistical significant or random chance.
To avoid making the analysis longer, we are just going to show the calculations for the first feature (kidhome), then we are just going to show the result of the analysis for the remaining features.
We start by making a table with the counts.
## # A tibble: 6 × 3
## # Groups: Kidhome, Response [6]
## Kidhome Response n
## <dbl> <dbl> <int>
## 1 0 0 1043
## 2 0 1 220
## 3 1 0 743
## 4 1 1 101
## 5 2 0 44
## 6 2 1 2
reaction_Kidhome <- matrix(Count_Kidhome$n, nrow=3, ncol=2, byrow=TRUE)
reaction_Kidhome
## [,1] [,2]
## [1,] 1043 220
## [2,] 743 101
## [3,] 44 2
dimnames(reaction_Kidhome) <- list(unique(Data$Kidhome), unique(Data$Response))
reaction_Kidhome
## 0 1
## 0 1043 220
## 1 743 101
## 2 44 2
chisq.test(reaction_Kidhome, simulate.p.value=TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_Kidhome
## X-squared = 15.978, df = NA, p-value = 0.001499
The result is a p-value of 0.0009995, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that Kidhome could be a potential feature for the model to make predictions.
chisq.test(reaction_Teenhome, simulate.p.value=TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_Teenhome
## X-squared = 64.254, df = NA, p-value = 0.0004998
The result is a p-value of 0.0004998, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that Teenhome could be a potential feature for the model to make predictions.
For this particular feature, in order to perform the Chi-square test, I had to divide the Dt_Customer in periods of 6 months.
# Create a new column with the 6-month period
Data$Period <- cut(Data$Dt_Customer, breaks = "6 months")
# Count the number of dates in each period
table(Data$Period)
##
## 2012-07-01 2013-01-01 2013-07-01 2014-01-01
## 470 561 578 544
Count_Period <- Data %>%
group_by(Period, Response) %>%
count(Period, sort = TRUE) %>%
arrange(Period) %>%
return(Period)
Count_Period
## # A tibble: 8 × 3
## # Groups: Period, Response [8]
## Period Response n
## <fct> <dbl> <int>
## 1 2012-07-01 0 342
## 2 2012-07-01 1 128
## 3 2013-01-01 0 467
## 4 2013-01-01 1 94
## 5 2013-07-01 0 525
## 6 2013-07-01 1 53
## 7 2014-01-01 0 496
## 8 2014-01-01 1 48
reaction_Period <- matrix(Count_Period$n, nrow=4, ncol=2, byrow=TRUE)
reaction_Period
## [,1] [,2]
## [1,] 342 128
## [2,] 467 94
## [3,] 525 53
## [4,] 496 48
dimnames(reaction_Period) <- list(Lista_periodos, unique(Data$Response))
reaction_Period
## 0 1
## 2012-07-01 342 128
## 2013-01-01 467 94
## 2013-07-01 525 53
## 2014-01-01 496 48
chisq.test(reaction_Period, simulate.p.value=TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_Period
## X-squared = 88.206, df = NA, p-value = 0.0004998
The result is a p-value of 0.0004998, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that Dt_Customer could be a potential feature for the model to make predictions.
chisq.test(reaction_Recency, simulate.p.value=TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_Recency
## X-squared = 98.154, df = NA, p-value = 0.0004998
The result is a p-value of 0.0004998, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that Recency could be a potential feature for the model to make predictions.
chisq.test(reaction_NumWebPurchases, simulate.p.value=TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_NumWebPurchases
## X-squared = 20.326, df = NA, p-value = 0.0004998
The result is a p-value of 0.0004998, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that NumWebPurchases could be a potential feature for the model to make predictions.
chisq.test(reaction_NumCatalogPurchases, simulate.p.value=TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_NumCatalogPurchases
## X-squared = 98.22, df = NA, p-value = 0.0004998
The result is a p-value of 0.0004998, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that NumCatalogPurchases could be a potential feature for the model to make predictions.
chisq.test(reaction_NumWebVisitsMonth, simulate.p.value=TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: reaction_NumWebVisitsMonth
## X-squared = 41.662, df = NA, p-value = 0.0004998
The result is a p-value of 0.0004998, this indicate a strong evidence against the null hypothesis, and therefore we conclude that this result is statistical significant.
This indicates that NumWebVisitsMonth could be a potential feature for the model to make predictions.